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Abstract. We propose a new approach for calculating the Lempel-Ziv factorization of a string, 
based on run length encoding (RLE). We present a conceptually simple off-line algorithm based 
on a variant of suffix arrays, as well as an on-line algorithm based on a variant of directed acyclic 
word graphs (DAWGs). Both algorithms run in 0{N + nlogn) time and 0{n) extra space, where 
04 I N is the size of the string, n < A'" is the number of RLE factors. The time dependency on A'^ is only 

in the conversion of the string to RLE, which can be computed very efficiently in 0{N) time and 
' 0(1) extra space (excluding the output). When the string is compressible via RLE, i.e., n = o{N), 

our algorithms are, to the best of our knowledge, the first algorithms which require only o{N) extra 
^ ■ space while running in o(N log N) time. 

o 



1 Introduction 



The run-length encoding (RLE) of a string 5 is a natural encoding of 5, where each maximal run of 
character a of length p in S is encoded as a^, e.g., the RLE of string aaaabbbaa is a^b^a^. Since RLE can 
be regarded as a compressed representation of strings, it is possible to reduce the processing time and 
working space if RLE strings are not decompressed while being processed. Many efficient algorithms that 
[ deal with RLE versions of classical problems on strings have been proposed in the literature (e.g.: exact 

pattern matching [1,13,0, approximate matching [J26|, edit distance [1, H, 13, 24|, longest common 



subsequence [17|,|25|, rank/select structures [2^, palindrome detection (ill). In this paper, we consider 
the problem of computing the Lempel-Ziv JactorizationiLZ factorization) of a string via RLE. 

The LZ factorization (and its variants) of a string [ij, IH, HHj , discovered over 30 years ago, captures 
i important properties concerning repeated occurrences of substrings in the string, and has applications 

I in the field of data compression, as well as being the key component to various efficient algorithms on 



' strings [16|, |20|- Therefore, there exists a large amount of work devoted to its efficient computation. A 

\ naive algorithm that computes the longest common prefix with each of the 0{N) previous positions 

. only requires 0(1) space (excluding the output), but can take 0{N^) time, where N is the length of the 

CN I string. Using string indicies such as suffix trees [13] and on-line algorithms to construct them [s^], the 

LZ factorization can be computed in an on-line manner in O (TV log \E\) time and 0{N) space, where \E\ 
is the size of the alphabet. Most recent algorithms P, [3, [13, E ESl first construct the suffix array 27 1 of 
. the string, consequently taking 0{N) extra space and at least 0{N) time, and are off-line. 

I Since the most efficient algorithms run in worst-case linear time and are practical, it may seem that 

I not much better can be achieved. However, a theoretical interest is whether or not we can achieve even 

faster algorithms, at least in some specific cases. In this paper, we propose a new approach for calculating 
the Lempel-Ziv factorization of a string, which is based on its RLE. The contributions of this paper are 
as follows: We first show that the size of the LZ encoding with self-references (i.e., allowing previous 
occurrences of a factor to overlap with itself) is at most twice as large as the size of its RLE. We then 
present two algorithms that compute the LZ factorizations of strings given in RLE: an off-line algorithm 
based on suffix arrays for RLE strings, and an on-line algorithm based on directed acyclic word graphs 
(DAWGs) [3| for RLE strings. Given an RLE string of size n, both algorithms work for general ordered 
alphabets, and run in O(nlogn) time and 0{n) space. Since the conversion from a string of size TV to its 
RLE can be conducted very efficiently in 0{N) time and 0(1) extra space (excluding the output), the 
total complexity is 0{N -\- nlogn) time and 0{n) space. 

In the worst-case, the string is not compressible by RLE, i.e. n ~ N. Thus, for integer alphabets, our 
approach can be slightly slower than the fastest existing algorithms which run in 0{N) time (off-line) 
or 0(7Vlog|i7|) time (on-line). However, for general ordered alphabets, the worst-case complexities of 



our algorithms match previous algorithms since the construction of the suffix array/sufhx tree can take 
0(iV log A^) time if \E\ = 0{N). The significance of our approach is that it allows for improvements 
in the computational complexity of calculating the LZ factorization for a non-trivial family of strings; 
strings that are compressible via RLE. If n = o{N), our algorithms are, to the best of our knowledge, 
the first algorithms which require only o{N) extra space while running in o(A^logA^) time. 

Related Work. For computing the LZ78 [s^ factorization, a sub-linear time and space algorithm was 
presented in [l8| . In this paper, we consider a variant of the more powerful LZ77 [sH i factorization. 
Two space efficient on-line algorithms for LZ factorization based on succinct data structures have been 
proposed [sSHlj. The first runs inO(iVlog^ N) time a.nd N\og\S\+o{Nloe\E\)+0{N) bits of space H^, 



and the other runs in 0(A^log A^) time with 0(iVlog|Z'|) bits of space [3l|. Succinct data structures 
basically simulate accesses to their non-succinct counterparts using less space at the expense of speed. A 
notable distinction of our approach is that we aim to reduce the problem size via compression, in order 
to improve both time and space efficiency. 



2 Preliminaries 

Let Af be the set of non-negative integers. Let 17 be a finite alphabet. An element of S* is called a string. 
The length of a string S is denoted by IS*]. The empty string e is the string of length 0, namely, |e| = 0. 
For any string S (1 S, let as denote the number of distinct characters appearing in S. Let = S* ~ {e}. 
For a string S = XYZ, X, Y and Z are called a prefix, substring, and sujfix of S, respectively. The set 
of prefixes of S is denoted by Prefix(S). The longest common prefix of strings X,Y, denoted lcp{X,Y), 
is the longest string in Prefix {X) n PrefixiY). The i-th character of a string S is denoted by S\i] for 
1 < i < \S\, and the substring of a string S that begins at position i and ends at position j is denoted 
by S[i..j] ior 1 < i < j < \S\. For convenience, let S[i..j] = £ if j < i. 

For any character a € S and p G Af, let denote the concatenation of p a's, e.g., — a, — aa, 
and so on. p is said to be the exponent of a^. Let a" = e. 

Our model of computation is the word RAM with the computer word size at least [logj bits, and 
hence, standard instructions on values representing lengths and positions of string S can be executed in 
constant time. Space complexities will be determined by the number of computer words (not bits). 



2.1 LZ Encodings 

LZ encodings are dynamic dictionary based encodings with many variants. As in most recent work, we 



describe our algorithms with respect to a well known variant called s-factorization jl3| in order to simplify 
the presentation. 

Definition 1 (s-factorization jl3l|). The s-factorization of a string S is the factorization 5 = /i ■ • • /„ 
where each s-factor fi G Z'+ (i ~ l,...,n) is defined inductively as follows: fi = S[l]. For i > 2: if 
S[\fi ■ ■ ■ fi~i \ + 1] = c £ S does not occur in fi ■ ■ ■ fi-i, then fi = c. Otherwise, fi is the longest prefix 
of fi ■ ■ ■ fn that occurs at least twice in fi ■ ■ ■ fi. 

Note that each s-factor can be represented in constant size, i.e., either as a single character or a pair of 
integers representing the position of a previous occurrence of the factor and its length. For example the 
s-factorization of the string 5* = abaabababaaaaabbabab is a, b, a, aba, baba, aaaa, b, babab. This can 
be represented as a, b, (1, 1), (1,3), (5,4), (10,4), (2, 1), (5,5). In this paper, we will focus on describing 
algorithms that output only the length of each factor of the s-factorization, but it is not difficult to 
modify them to output the previous position as well, in the same time and space complexities. 



2.2 Run Length Encoding 

Definition 2. The Run-Length (RL) factorization of a string S is the factorization /i, . . . , /« of S such 
that for every i = 1, . . . , n, factor fi is the longest prefix of fi ■ ■ ■ f„ with fi £ {a^ \ a G S,p > 0}. 

Note that each factor fi can be written as fi = ' for some character Oi E S and some integer 
Pi > and for any consecutive factors fi = af' and /i+i = af^^j have that a^ ^ a^+i. The run length 



2 



encoding (RLE) of a string S, denoted RLEs, is a sequence of pairs consisting of a character a,; and an 
integer pi, representing the RL factorization. The size of RLEs is the number of RL factors in RLEs ^md 
is denoted by size{RLE s), i.e., if RLEs ~ o-i^ ■ ' 'O-n", then size{RLE s) = n. RLEs can be computed 
in 0{N) time and 0(1) extra space (excluding the 0{n) space for output), where N = \S\, simply by 
scanning S from beginning to end, counting the exponent of each RL factor. Also, noticing that each RL 
factor must consist of the same alphabet, we have as < n. 

Let val be the function that "decompresses" RLEs, i-e., val{RLE s) = S. For any I < i < j < n, 
let RLEs[i.-j] = af 0^4.^ •'•af • For convenience, let RLEs[i-j] = e if i > j. Let RLE.Suhstr{S) = 
{RLEs[i:i] \ l<i,3 <n} and RLE.Suffix{S) = {RLEs[i-n] \ l<i< n}. 

The following simple but nice observation allows us to represent the complexity of our algorithms in 
terms of size{RLEs)- 

Lemma 1. For a given string S , let url o,nd n^z respectively be the number of factors in its RL 
factorization and s-factorization. Then, n^z < '^nfi^. 

Proof. Consider an s- factor that starts at the jth position in some RL-factor a^' where 1 < j < pi. Since 
a^'~''^^ is both a suffix and a prefix of of', we have that the s- factor extends at least to the end of of'. 
This implies that a single RL-factor is always covered by at most 2 s-factors, thus proving the lemma. 

Note that for LZ factorization variants without self-references, the size of the output LZ encoding may 
come into play, when it is larger than 0{size{RLE s)). 

2.3 Priority Search Trees 

In our LZ factorization algorithms, we will make use the following data structure, which is essentially an 
elegant mixture of a priority heap and balanced search tree. 

Theorem 1 (McCreight |28l | ) . For a dynamic set D which contains n ordered pairs of integers, the 
priority search tree (PST) data structure supports all the following operations and queries in O(logn) 
time, using 0{n) space: 

— Insert{x,y) : Insert a pair {x,y) into D; 

— Delete{x,y): Delete a pair {x,y) from D; 

— MinXInRectangle[L , R, B) : Given three integers L < R and B, return the pair {x,y) G D with 
minimum x satisfying L < x < R and y > B; 

— MaxXInRectangle{L, R, B): Given three integers L < R and B, return the pair {x,y) € D with 
maximum x satisfying L < x < R and y > B; 

— MaxYInRange{L, R) : Given two integers L < R, return the pair {x,y) £ D with maximum y satisfy- 
ing L < X < R. 

3 Off-line LZ Factorization based on RLE 

In this section wc present our off-line algorithm for s-factorization. The term off-line here implies that 
the input string S of length N is first converted to a sequence of RL factors, RLEs = 0.1^0.2^ ■ ■ ■ ^n"- 
the algorithm which follows, we introduce and utilize RLE versions of classic string data structures. 

3.1 RLE Suffix Arrays 

Let Srles = {RLEs[i] \ i = l,...,ri}. For instance, if RLEs = a^b^a^b^a^b^a*, then Srles — 
{a^, a'', a^, b^}. For any a^'jo'j^ G SrlEsi let the order -< on SrlEs be defined as a^' -< a^^ <=^ Oi < 
aj, or Qi = Oj and pi < pj. The lexicographic ordering on RLE _Suffix[S) is defined over the order on 
Srlesj s-nd our RLE version of suffix arrays j27| is defined based on this order: 

Definition 3 (RLE suffix arrays). For any string S, its run length encoded suffix array, denoted 
RLE_SAs, is an array of length n ~ size{RLE s) such that for any 1 < i < n, RLE_SAs[i] = j when 
RLE s[j--n\ is the lexicographically i-th element of RLE_Suffix{S). 
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Let SparseSuffix{S) = {val{s) \ s € RLE_Suffix{S)}, namely, SparseSujJix{S) is the set of "uncom- 
pressed" RLE suffixes of string S. Note that the lexicographic order of RLE_Suffix{S) represented by 
RLE.SAs is not necessarily equivalent to the lexicographic order of SparseSuffix{S) . In the running exam- 
ple, RLE_SAs = [5, 3, 1, 7, 4, 2, 6]. However, the lexicographical order for the elements in SparseSuffix{S) 
is actually (7,1,3,5,6,2,4). 

Lemma 2. Given RLEs for any string S € E* , RLE_SAs can he constructed in O(nlogn) time, where 
n = size{RLEs)- 

Proof. Any two RL factors can be compared in 0(1) time, so the lemma follows from algorithms such 
as in 0|. 

Let RLE_RANKs be an array of length n = size{RLEs) such that RLE.RANKs[i] = i 
RLE.SAs[i\ — j. Clearly RLE.RANK s can be computed in 0{n) time provided that RLE_SAs is 
already computed. To make the notations simpler, in what follows we will denote rs{h) ~ RLE_SAs[h] 
and rr[h) = RLE_RANKs[h] for any 1 < /i < n. 

For any RLE strings RLEx and RLEy with val{RLEx) = X and val{RLEY) = Y ^ let lcp{RLEx, 
RLEy) ~ lcp{X,Y), i.e., Icp{RLEx,RLEy) is the longest prefix of the "uncompressed" strings X and 
Y. It is easy to see that Z = lcp{RLEx, RLEy) can be computed in 0{size{RLE z)) time by a naive 
comparison from the beginning of RLEx and RLEy, adding up the exponent pi of the RL factors while 
a^' = RLEx[i] — RLEy[i], and possibly the smaller exponent of the first mismatching RL factors, 
provided that they arc exponents of the same character. 

The following two lemmas imply an interesting and useful property of our data structure; although 
RLE.SAs does not necessarily correspond to the lexicographical order of the uncompressed RLE suffixes, 
RLE suffixes that are closer to a given entry in the RLE_SAs have longer longest common prefixes with 
that entry. 

Lemma 3. Let i, j be any integers such that 1 < i < j < n. (1) For any j' > j, \lcp{RLE s[rs{i)..n], 
RLEs[rs{j)..n])\ > \lcp{RLEs[rs{i)..n], RLEs[rs{j')..n])\. (2) For any i' < i, \lcp{RLE s[rs{i)..n], 
RLEs[rs{j)..n])\ > \lcp{RLE s[rs{i')..n], RLE s[rs{j)..n])\. 

Proof. We only show (1). (2) can be shown by similar arguments. Let 



Namely, the first (k — 1) RL factors of RLE s[rs{i)..n] and RLE s[rs{j)..n] coincide and the fcth RL 
factors differ. The same goes for k', RLE s[rs{i)..n], and RLE s[rs{j')..n]. Since j' > j, k > k' . If fc > fc', 
then clearly the lemma holds. If fc = fc', then RLEs[rs{i) + fc] -< RLEs[rs{j) + k] ^ RLEs[rs{j') + fc]. 
This implies that \lcp{RLE s[rs{i) + k], RLE s[rs{j) + fc])| > \lcp{RLE s[rs\i) + k], RLE s[rs{j') + fc])|. 
The lemma holds since for these pairs of suffixes, the RL factors after the fcth do not contribute to their 
/cps. 

3.2 LZ factorization using RLE_SA 

In what follows wc describe our algorithm that computes the s- factorization using RLE_SAs. Assume 
that we have already computed the first (j — 1) s-factors /i, /2, . . . , fj-i of string S. Let Xlt^i \ fh\ = ^~ Ij 
i.e., the next s-factor fj begins at position £ of S. Let d = min{fc | X)i=i(Pi) ^ ^} + 1: i-e-, the {d — l)-th 
RL factor a^^Y contains the occurrence of the ^-th character S[i] = Od-i of S. Let q = X^i^Ti^fe) ~ ^ + 1, 
i.e., RLEs[i = a^_^a^'' • • •a^''. Note that 1 < q < Pd-i- 

The task next, is to find the longest previously occurring prefix of a'^^_^a^'' ■ ■ ■ aj^" which will be fj. 
The difficulty here, compared to the non-RLE case, is that fj will not necessarily begin at positions in 
S corresponding to an entry in the RLE^SAg. A key idea of our algorithm is that rather than looking 
directly for a^_-^a^'^ • • • a^" , we look for the longest previously occurring prefix of RLEs[d..n] = aFf ■ ■ ■ 
whose occurrence is immediately preceded by which will have corresponding entries in RLE_SAs. 

To compute fj in this way, we use the following lemma: 



fc = min{i | RLE s[rs{i)..rs{i) + t - I] ^ RLE s[rs{j)..rs{j) + t - 1]} and 
fc' = min{i' | RLE s[rs{i)..rs{i) + t' - l\ ^ RLE s[rs{j')..rs{j') + 1' - 1]}. 
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Lemma 4. Assume the situation mentioned above. Let k = max({pi | a; = fld-i, l<i<d — 2}U {0}). 
(Case 1) If q — Pd~i o,nd q > k, then \ fj \ = niax{fc, 1}. (Case 2) Otherwise, 

\fj\ = q + Tnscyi{\lcp{RLE s[rs{xi)..n], RLE s[d..n])\, \lcp{RLE s[d..n], RLE s[rs{x2)-n])\] 

where 

Xi = max{M I 1 < rs{u) < d, u < rr{d), ars{u)-i = id-i, Prs{u)-i > <?} md 
X2 = inm{v I 1 < rs{v) < d, rr{d) < v, a^^^^)^^ = ad-i, Prsiv)-i > 

Proof. Case (1) is when the new s- factor begins at the beginning of the RL-factor a^l/' ^^^^ must 
end somewhere inside it. Otherwise (Case (2)), \fj\ is at least q since either q < pd-i and a^_j^ = 
S[£..£ + q—l] = S[£ —!..£ + q — 2] is a prefix of fj due to the self-referencing nature of s-factorization, or, 
there exists an RL factor a^' such that 1 < i < d — 2, aa-i = ftj and q < pi. Let fj = a'^^_^X . Then X is 
the longest of the longest common prefixes between RLE s[d..n] and RLE s[h..n] for &\\ 1 < h < d, that 
are immediately preceded by a^_]^; i-C. a^-i = a^-i and ph-i > q. It follows from Lemma [3] that of all 
these h, the one with the longest Icp with RLE s[d..n] is cither of the entries that are closest to position 
rr{d) in the suffix array. The positions xi and X2 of these entries can be described by the equation in 
the lemma statement, and thus the lemma holds. 

Once xi and X2 of the above Lemma are determined, our algorithm is similar to conventional non- 
RLE algorithms that use the suffix array. The main difficulty lies in computing these values, since, unlike 
the non-RLE algorithms, they depend on the value q which is determined only during the s-factorization. 
The main result of this section follows. 

Theorem 2. Given RLEs of any string S G S* , we can compute the s-factorization of S in 0{n\ogn) 
time and 0{n) space where n = size{RLEs). 

Proof. First, compute RLE.SAs in 0(n log ri) time, and RLE_RANK s in 0{n) time. Next we show how 
to compute each s-factor /-,- using Lemma SI 

Recall that the s-factor begins somewhere in the {d — l)-th RL factor. Wc shall maintain, for each 
character a S X", a PST Ta of Theorem [T] for the set of pairs V^^^ — {(x, y) | x — rr{i), Oi-i — 
a, y ^ Pi~i, 1 < i < d — 1}, i.e., the x coordinate is the position in the suffix array, and the y 
coordinate is the exponent of the preceding character of that suffix. Then, we can easily check whether 
the condition of case (1) is satisfied by computing the k of Lemma [3] as k = MaxYInRange{l,n) on 
Taj_i, which can be computed in O(logn) time by Theorem ^M. For case (2), we obtain xi and X2 
as: Xi = MaxXInRectangle{l,rr{d) — l,q) and X2 = MinXInRectangle{rr{d) + l,n,q) in O(logn) time, 
using Ta^_^. To compute the length of the Icp's, lcp{RLE s[rs{xi)..n], RLE s[d..n]) and lcp{RLE s[d..n], 
RLE s[rs{x2)..n\), we simply use the naive algorithm mentioned previously, that compares each RL 
factor from the beginning. Since the longer of the two longest common prefixes is adopted, the number 
of comparisons is at most twice the number of RL factors that is spanned by the determined s-factor. 
From Lemma [U the total number of RL factors compared, i.e. XlT^f size{RLE f.), is 0{n). 

After computing the s-factor fj, we update the PSTs. Namely, if fj spans the RL factors a^'' ■ ■ ■ a^^j^^ , 
then we insert pair (rr(i),pi_i) intoTa._j, for all d < i < d+g. These insertion operations take 0{g\ogn) 
time by Theorem [1] which takes a total of 0(7ilogn) time for computing all fj. Hence the total time 
complexity is O(nlogn). 

We analyze the space complexity of our data structure. Notice that a collection of sets C/f for all 
characters a £ S are pairwise disjoint, and hence X)aei: l^a^^l = d — 1. By Theorem[l] the overall size 
of the PSTs is 0{n) at any stage of ci = l,2,...,n. Since RLE_SAs and RLE_RANKs occupy 0{n) 
space each, we conclude that the overall space requirement of our data structure is 0{n). 

4 On-line LZ Factorization based on RLE 

Next, we present an on-line algorithm that computes s-factorization based on RLE. The term on-line 
here implies that for a string S of (possibly unknown) length N , the algorithm iterativcly computes the 

^ Actually, this is an 0(1) operation on the PST, since the information for the pair with maximum y in the 
entire range of x, is contained in the root of the PST. 
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Fig. 1. Illustration for the RLE DAWG of a^b^a*'b^a''c*a'^°. The edges in E are represented by the solid arcs, 
while the suffix links of some nodes are represented by dashed arcs (but their labels are omitted). For simplicity 
the suffix links of the other nodes are omitted in this figure. 



output for input string S'[l..i] for each i = 1,...,N (the output of 5'[l..i] can be reused to compute 
the output for S[l..i + 1]). For example, RLEs of string S of length N can be computed on-line in a 
total of 0{N) time and 0{n) space (including the output), where n = size{RLEs)- In the description 
of our algorithms, this definition will be relaxed for simplicity, and we shall work on RLEs, where the 
s-factorization of val{RLEs[l--i\) is iteratively computed for i = 1, 2, ... n. 

Note that the off-line algorithm described in the previous section cannot be directly transformed to 
an on-line algorithm, even if we simulate the suffix array using suffix trees, which can be constructed 
on-line [s^. This is because the elements inserted into the PST depended on the lexicographic rank of 
each suffix, which can change dynamically in the on-line setting. Nonetheless, we overcome this problem 
by taking a different approach, utiliziiig remarkable characteristics of a string index structure called 
directed acyclic word graphs {DAWGs) [6|. 

4.1 RLE DAWGs 

The DAWG of a string S is the smallest automaton that accepts all suffixes of S. Below we introduce an 
RLE version of DAWGs: We regard RLEs as a string of length n over alphabet E^lEs = {RLEs[i] \ 
i — l,...,ri}. For any u g RLE_Substr{S), let EndPos RLEsi'^) denote the set of positions where an 
occurrence of u ends in RLEs, i-c, EndPos n^Esi'^) = {j \ ~ RLEs[i--j],l < * < J < for any 
u G and EndPos rles{^) = {Oi • • • i"-}- Define an equivalence relation for any u,w ^ RLE _Suhstr[S) 
by u =fjiEs ^ EndPosftLEs{'^) — EndPosRLEsi'^)- The equivalence class of u G RLE_Substr{S) 
w.r.t. =B.LEs is denoted by {u]rlEs- When clear from context, we abbreviate the above notations as 
EndPos, = and [m], respectively. Note that for any two elements in [m], one is a suffix of the other (or 
vice versa). We denote by the longest member of [u]. 

Definition 4. The run length encoded DAWG of a string S e S* , denoted by RLE_DAWGs, is the 
DAWG of RLEs over alphabet Srles = {RLEs[i] | i = Namely, RLE_DAWGs = {V,E) 

where V = {[u] \ u 6 RLE_Substr{S)} and E = {{[u],aV, [ua^]) \ u,uaP e RLE_Substr{S), u ^ uoP}. 

We also define the set F of labeled reversed edges, called suffix links, by F = {([a^'u],a^, [u]) \ u,aPu S 
RLE.Substr{S),u = tT}. See also Fig. □ that illustrates RLE.DAWGs for RLEs = a^b^a^b^a^c^aio. 
Since EndPos (h'^a.^) ~ EndPos{a.^) = {3, 5}, b^a^ and a^ are represented by the same node. On the other 
hand, EndPos (a.^h'^a.^) = {3} and hence a^b^a^ is represented by a different node. 

Lemma 5. Given RLEs of any string S where n = size{RLE s) , RLE_DAWGs has Oin) nodes and 
edges, and can be constructed in 0{n\ogn) time and 0{n) extra space in an on-line manner, together 
with the suffix link set F. 

Proof. A simple adaptation of the results from Q. (See Appendix for full proof.) 

4.2 On-line LZ factorization using RLE_DAWG 

The high-level structure of our on-line algorithm follows that of the off-line algorithm described in the 
beginning of Section [3.21 In order to find the longest previously occurring prefix of ■ • • a^" , which 
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is the next s- factor, we construct the RLE.DAWGs on-hne for the string up to RLEs[l--d 1] = 
a\^a^ ■ ■ ■ a^lY ^"^^ instead of using the RLE.SAs- The difficulty is, as in the off-hne case, that 

only the suffixes that start at a beginning of an RL factor is represented in the RLE_DAWG . Therefore, 
we again look for the longest previously occurring prefix of RLEs[d..n] = aF^ ■ ■ ■ a!^ that is immediately 
preceded by a^_]^ in S, rather than looking directly for a\_T^aFj^ ■ ■ ■ a^". We augment the RLE_DAWG 
with some more information to make this possible. 

Let E[u] denote the set of out-going edges of node [u]. For any edge e = ([m], 6', G and 

each character a € define mpeg(a) = max({p | a^'^W G RLE_Suhstr{S)} U {0}). That is, mpeg(a) 
represents the maximum exponent of the RL factor with character a, that precedes in S. 

Lemma 6. Given RLEs of any string S £ E* , RLE_DAWGs — {^^E), augmented so that mpep(a) 
can be computed in Oilogas) time for any e £ E and any character a £ S, can he constructed in an 
on-line manner in a total of 0{n\ogn) time with 0{n) space. 

Proof. When computing mpeg(a), consider the following cases: (Case 1) ^16' is not the longest member of 
[tr6'] , i.e. 7^ u6'. For any j G i?ndPo,s(tr6') let j' = j-si2e(u). We have that RLEs[j'] = where 
ajf^tf = ^b'^ , i.e., tTfo' is always immediately preceded by a^,^' in RLEs- Therefore, mpeg(a) = pji if 
Oji = a and otherwise. For any node [v] G V, an arbitrary j G EndPosiv) can be easily determined in 
0(1) time when the node is first constructed during the on-line construction of RLE.DAWGs, and does 
not need to be updated. 

(Case 2) ^b'' is the longest member of [tTfe*?], i.e. tTfo*^ = u6'. For each occurrence of aPtr&« = aPubi G 
RLE_Substr{S), there must exist a suffix link ([a^wfe'^], a^, G F. Therefore mpeg(a) is the maximum 

of the exponent in the labels of all such incoming suffix links, or if there are none. By maintaining a 
balanced binary search tree at every edge e, we can retrieve this value for any a G E in 0(log as) time. It 
also follows from the on-line construction algorithm of RLE_DAWGs that the set of labels of incoming 
suffix links to a node only increases, and we can update this value in 0{logas) time for each new suffix 
link. Since \F\ = 0{n), constructing the balanced binary search trees take a total of 0{n\ogas) time, 
and the total space requirement is 0{n). 

In order to determine which case applies, it is easy to check whether trfo"^ is the longest element of 
[tr&''] in 0(1) time by maintaining the length of the longest path to any given node during the on-line 
construction of RLE_DAWGs. This completes the proof. 

Lemma 7. Given RLEs of any string S G S* , RLE_DAWGs — {y^E), augmented so that max{p 
mpeg(a) > q,e — ([u], b^, [w]) G i?[M]} can be computed in O(logn) time for any e € E, character a £ S, 
and integer q > 0, can be constructed in an on-line manner in a total of 0{nlogn) time with 0{n) space. 

Proof. During the on-line construction of the augmented RLE.DAWGs of LcmmalHl we further construct 
and maintain a family of PSTs at each node of RLE_DAWGs with a total size of 0{n), containing the 
information to answer the query in 0{nlogn) time. (See Appendix for full proof.) 

The next lemma shows how the augmented RLE_DAWGs can be used to efficiently compute the 
longest prefix of a given pattern string that appears in string S. 

Lemma 8. For any pattern string P G S* , let RLEp = b'^b'^ ■ ■ ■ 6^*. Given RLEp, we can compute 
the length of the longest prefix P' of P that occurs in string S in 0{size{RLEp>)\ogn) time, using a 
data structure of 0{n) space, where n ~ size{RLEs). 

Proof. The outline of the procedure is shown in Algorithm [1] First, we check whether the first RL 
factor bf^ of P is a substring of S (Line [2]). If so, the calculation basically proceeds by traversing 
RLE.DAWGs with bfbf ■ ■ ■ until there is no outgoing edge with bf (i.e. bf--- bf ^ RLE.Substr{S)), 
or, there is no occurrence of '''^T that is immediately preceded by b\, where q > qi, in S. If 
shortcut ~ false, • ■ • &f is the longest element in the node, and the latter check is conduced by the 
condition mpeg(6i) > qi. If shortcut = true, the character preceding any occurrence of b'^ ■■■bf is 
uniquely determined and already checked in a previous edge traversal, so no further check is required. 

By Lemmas [5l [51 and [3 the length of the longest prefix P' of P that occurs in S can be computed 
in O {size {RLE p') {log n -\- logo's)) = 0{size{RLEp>) logn) time using a data structure of 0{n) space. 
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Algorithm 1: Pattern Matching on RLE_DAWGs 



Input: RLE.DAWGs = {V, E), RLEp = 6f ■ ■ • 6^' 

1 h = max{q \ {[e],bl,[w]) € 

2 if (m = 1) or {h < qi) then Output mm{h,qi) and return; 

3 shortcut ~ false; v := [e]; 

4 for i = 2, . . . , m do 
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if e 



(v, 6'% [w]) £ By and (shortcut = true or mpc^(6i) > qi) then 
V — H; // RLEp[2..i] £ [w] 

if RLE p[2..i] is not the longest element of [w] then shortcut 



true 



else 



// A: :maxiimim exponent of bi such that val{h'!^h'^ ■■■b\) G Substr{S) . 
if shortcut = true then k :— max{g | (v, 6', [w]) £ Ey}; 
else k := max{g | mpe^{bi) > qi,e = (v, fe', [w]) G Ey}; 



Output |t;a/(6'^fe! 



,91 ^92 



,min{(jj,fc} 



)| and return; 



12 Output jP| and return; // P itself occurs in S 



Below we give an example for Lemma [51 See Fig. [T] that illustrates RLE.DAWGs for RLEs = 
a^b^a^b^a^c^a-"-", and consider searching string S for pattern P with RLEp = a^b^a^. We start traversing 
RLE.DAWGs with the second RL factor b^ of P. Since there is an out-going edge labeled b^ from the 
source node we reach node v = [b^]. There are two suffix links that point to node v, ([a^b^], a^, [b^]) and 
([a^b^],a^, [b^]). Hence mpe^j^j j,2 [^2]) (a) = max{3,5} = 5, and thus the prefix a^b^ of P occurs in S. We 
examine whether a longer prefix of P occurs in S by considering the third RL factor a'^. There is no 
out- going edge from v that is labeled a^, hence the longest prefix of P that occurs in S is of the form 
a^b^a^ for some ^ > 0. We consider the set Ey{a.) of out-going edges of v that are labeled a'' for some 
q, and obtain Ey{a.) ~ {([b^],a^, [b^a^])}. We have mpeQjj2] [b2a=])(^) = max{3,5} = 5 due to the two 
suffix links pointing to [b^a^]. Thus, the longest prefix of P that occurs in S is a^b^a"""''^'^^ = a^b^a^. 

Theorem 3. Given RLEs for any string S £ S* , the s-factorization of S can be computed in an on-line 
manner in O(nlogn) time and 0{n) extra space, where size{RLE s) = n. 

Proof. Assume the situation described in the first paragraph of Section 13.21 In addition, assume that 
we have constructed RLE_DAWGg~^ , the RLE_DAWG (with augmentations described previously) for 
RLEs[l..d - 1] = af af' • ■ ■ a^J^x ■ By definition, the longest prefix P' of F = S[i..N] such that P' e 
Suhstr{val{a\^ ■ ■ ■ a^''^^)), is a prefix of fj. By LemmalU we can compute \P'\ in 0{size{RLEpi) logd) 
time. 

Notice that by definition of P', fj can be longer than P' only when P' is a suffix of val{RLE s[l--d — 
1]). We can check this by a naive matching in O {size {RLE pi)) time and if so, also compute W = 
lcp{RLE s[d..n], RLE s[i+\p'\..N]) again by a naive matching in 0{size{RLEw)) time, and obtain \fj\ = 
\P'\ + \W\, therefore computing \fj\ in 0{size{RLE f.)) time. If fj spans the RL factors a^"* • • • oFd^g i we 

update RLE.DAWG'l''^ to RLE.DAWG'^^'^ in amortized 0(5 log n) time by Lemma H Thus, totaling 
for all /j, we can compute the s-factorization in 0(71 log n) time. 0{n) space complexity follows from 
Lemmas [5l ini [3 and|51 For any i > 1, the s-factorization of val{RLEs[l-.i — 1]) and the s-factorization 
of val{RLEs[l.-i]) differs only in the last 1 or 2 factors. It is easy to see that the s-factorization of 
val{RLEs[l..i]) is itcratively computed for i = 1, . . . ,n, and the computation is on-line. 



5 Discussion 



We proposed off-line and on-line algorithms that compute a well-known variant of LZ factorization, called 
s-factorization, of a given string S in 0{N + 77. log n) time using only 0{n) extra space, where N = \S\ 
and n = size{RLEs)- After converting S to RLEs in 0{N) time and 0{\) extra space (excluding the 
output), the main part of the algorithms work only on RLEs, running in 0(71 logn) time and 0{n) space, 
and therefore can be more time and space efficient compared to previous LZ factorization algorithms 
when the input strings are compressible by RLE. Our algorithms are theoretically significant in that 
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they are the only algorithms which achieve o{N log N) time using only o{N) extra space for strings with 
n = o(iV), thus offering a substantial improvement to the asymptotic time complexities in calculating the 
s-factorization, for a non-trivial family of strings. Our algorithms can be easily extended to other variants 
of LZ factorization. For example, let m be the size of the s-factorization without self-references of a given 
string. Since Lemma [1] does not hold for s-factorization without self-references, the time complexity of 
the algorithm is 0{N -\- {n + m) logn). The working space remains 0{n) (excluding the output). 

Since conventional string data such as natural language texts arc not usually compressible via RLE, 
the algorithms in this paper, although theoretically interesting, may not be very practical. However, our 
approach may still have potential practical value for other types of data and objectives. For example, 
a piece of music can be thought of as being naturally expressed in RLE, where the pitch of the tone 
is a character, and the duration of the tone is its run length. Other than for the applications to string 
algorithms (ig . [20| , mentioned in the Introduction, Lempel Ziv factorization on such RLE compressible 
strings can be important, due to an interesting application of compression, including LZ77 (gzip), as a 
measure of distance between data, called Normalized Compression Distance (NCD) [23|. NCD has been 
shown to be effective for various clustering and classification tasks, including MIDI music data, while 
not requiring in-depth prior knowledge of the data [l^, The NCD between two strings S and T 
w.r.t. a compression algorithm basically depends only on the compressed sizes of the strings 5*, T, and 
their concatenation ST. Therefore, efficiently computing their s-factorizations from RLEs, RLEt, and 
RLE ST would contribute to making the above clustering and classification tasks faster and more space 
efficient. 

Our algorithms are based on RLE variants of classical string data structures. However, our approach 
does not necessarily make the use of succinct data structures impossible. It would be interesting to 
explore how succinct data structures can be used in combination with our approach to further improve 
the space efficiency. 
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Appendix 

This appendix provides complete proofs that were omitted due to lack of space. 

Lemma m Given RLEs of any string S where n = size{RLEs), RLE_DAWG s has 0{n) nodes and 
edges, and can be constructed in 0(nlogn) time and 0{n) extra space in an on-line manner, together 
with the suffix link set F. 

Proof. The proof is a simple adaptation of the results from [61] . The DAWG of a string of length m has 
0(m) nodes and edges. Since RLE.DAWGs is the DAWG of RLEs of length n, RLE_DAWGs clearly 
has 0{n) nodes and edges. If a is the number of distinct characters appearing in S, then the DAWG of 
a string of length m can be constructed in 0(m log cr) time and 0{m) space, in an on-line manner, using 
sufRx links. Since iZ'/j^ggl < n, RLE_DAWGs with F can be constructed in O(nlogn) time and extra 
0{n) space, on-line. 

Lemma [3 Given RLEs of any string S £ S* , RLE_DAWGs = iV,E), augmented so that max{p 
mpeg(a) > q,e = {[u], W , [w\) € can be computed in 0(log?i) time for any e € E, character a ^ S, 

and integer q > 0, can be constructed in an on-line manner in a total of 0{n\ogn) time with 0(n) space. 

Proof. During the on-line construction of the augmented RLE .DAWG of Lemma [51 we further construct 
a family of PSTs at each node. Let Tu,a,b denote the PST at node u G V that contains the set of pairs 
Uu,a,b = {(^'i inpeg(a)) | e = {[u],^^', [w]) € mpeg(a) > 0}, where a,b G E. By maintaining a two- 
level balanced binary search tree (a balanced binary search tree inside each node of the first balanced 
binary search tree) at each node, T„^a,b can be accessed for any a, 6 6 in Oilogas) time. Note that 
empty PSTs will not be inserted, and hence the total space will be proportional to the number of elements 
contained in all PSTs. Furthermore, the number of elements in a single PST is bounded by 0(n), so 
max{p I mpeg(a) > q,e = [w]) G i^iu]} can be computed as MaxXLnRectangle{l, \S\,q) on T-u^a,b, 

in 0(log7i) time by Theorem [TJ 

We now bound the total number of elements in all of the PSTs. Recall that mpeg(a) = max({p 
aPtx6« G RLE.Substr{S)} U {0}). 

When a suffix link pointing to a node [u] is created, and when an out-going edge of [u] is created, we 
must update the PSTs associated with [u]. An edge ([u], b^, [v]) is called primary if sz2e(tr) + l = ,size(V), 
and is called secondary otherwise. 

First, we consider the updates of the PSTs due to the suffix links. Suffix links are created in the 
following situations (Also refer to ^| for the on-line construction algorithm of DAWGs) : 

1. The suffix link of the sink node is created. 

2. After a node childState is split, then a suffix link from childState to newChildState is created, 
where newChildState is the new node created by the node split. 

Case[TJ Let [v] be the node that is pointed by the suffix link of the sink node. Let e = ([m], 6^, [v]) be 
the primary edge to [v], and a"^ be the RL factor which corresponds to the suffix link, li q > mpeg(a), 
then delete the pair (p, mpeg(a)) from the corresponding PST stored in [u], and insert a new pair {p,q) 
into the PST. 

CaseO There will be no updates in the PSTs. We will discuss the details in Case [3] of edge creation. 
Next, we consider the updates of the PSTs due to the edges. Edges are created in the following 
situations: 

1. A primary edge from the old sink currentSink to the new sink newSink is created. 

2. A secondary edge to new Sink is created. 

3. After a node childState is split, then the secondary edge to childState from one of its parents 
parentState becomes the primary edge from parentState to newChildState. 

4. After a node childState is split, then all the outgoing edges of childState are copied as the outgoing 
edges of newChildState. 

5. After a node childState is split, some secondary edges to childState are redirected and become 
secondary edges to newChildState. 
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♦ primary edge 

>■ secondary edge 

> suffix link 

Fig. 2. Illustration for node split of the RLE^DAWG construction algorithm. Node childState (depicted as 
#2) is split into nodes childState and newChildState (depicted as #3). Note that all the outgoing edges of 
newChildState are secondary edges. 

Case[T] Although a new primary edge is created, no pairs are inserted into nor deleted from the PST 
since there are no suffix links to new Sink. 

Cased) Let [u] be the node from which a secondary edge to newSink is created. We then insert a 
pair corresponding to the secondary edge into the PST of [u]. 

Case [21 In this case, the secondary edge becomes a primary edge. So seemingly we might need to 
delete the pair corresponding to the existing secondary edge and insert a new pair corresponding to the 
incoming suffix link. However, both pairs are actually identical, and hence we need no updates in the 
PSTs. 

Case 131 Since the copied edges are all secondary edges (see Fig. similar updates to Case [H are 
conducted for all the copied edges. 

Case [5] Let e and ef be secondary edges before and after redirection, respectively. By the property 
of the equivalence class, we have that mpeg(a) = mpep,(a) for any character a E S. Hence we need no 
explicit updates of the PSTs. 

By the above discussion, the number of update operations can be bounded by the number of added 
edges and suffix links. Since the total number of edges and suffix links is 0{n), the total number of 
pairs in all of the PSTs X][u]eVa bes \Uu,a,b\ is also 0{n). The total time complexity for the updates is 
0{n log n). 
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